Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
Neural radiance fields (NeRF) achieve highly photo-realistic novel-view synthesis, but it's a challenging problem to edit the scenes modeled by NeRF-based methods, especially for dynamic scenes. We propose editable neural radiance fields that enable end-users to easily edit dynamic scenes and even support topological changes. Input with an image sequence from a single camera, our network is trained fully automatically and models topologically varying dynamics using our picked-out surface key points. Then end-users can edit the scene by easily dragging the key points to desired new positions. To achieve this, we propose a scene analysis method to detect and initialize key points by considering the dynamics in the scene, and a weighted key points strategy to model topologically varying dynamics by joint key points and weights optimization. Our method supports intuitive multi-dimensional (up to 3D) editing and can generate novel scenes that are unseen in the input sequence. Experiments demonstrate that our method achieves high-quality editing on various dynamic scenes and outperforms the state-of-the-art. We will release our code and captured data.
translated by 谷歌翻译
本文回顾了AIM 2022上压缩图像和视频超级分辨率的挑战。这项挑战包括两条曲目。轨道1的目标是压缩图像的超分辨率,轨迹〜2靶向压缩视频的超分辨率。在轨道1中,我们使用流行的数据集DIV2K作为培训,验证和测试集。在轨道2中,我们提出了LDV 3.0数据集,其中包含365个视频,包括LDV 2.0数据集(335个视频)和30个其他视频。在这一挑战中,有12支球队和2支球队分别提交了赛道1和赛道2的最终结果。所提出的方法和解决方案衡量了压缩图像和视频上超分辨率的最先进。提出的LDV 3.0数据集可在https://github.com/renyang-home/ldv_dataset上找到。此挑战的首页是在https://github.com/renyang-home/aim22_compresssr。
translated by 谷歌翻译
Video semantic segmentation (VSS) is beneficial for dealing with dynamic scenes due to the continuous property of the real-world environment. On the one hand, some methods alleviate the predicted inconsistent problem between continuous frames. On the other hand, other methods employ the previous frame as the prior information to assist in segmenting the current frame. Although the previous methods achieve superior performances on the independent and identically distributed (i.i.d) data, they can not generalize well on other unseen domains. Thus, we explore a new task, the video generalizable semantic segmentation (VGSS) task that considers both continuous frames and domain generalization. In this paper, we propose a class-wise non-salient region generalized (CNSG) framework for the VGSS task. Concretely, we first define the class-wise non-salient feature, which describes features of the class-wise non-salient region that carry more generalizable information. Then, we propose a class-wise non-salient feature reasoning strategy to select and enhance the most generalized channels adaptively. Finally, we propose an inter-frame non-salient centroid alignment loss to alleviate the predicted inconsistent problem in the VGSS task. We also extend our video-based framework to the image-based generalizable semantic segmentation (IGSS) task. Experiments demonstrate that our CNSG framework yields significant improvement in the VGSS and IGSS tasks.
translated by 谷歌翻译
Latent factor model estimation typically relies on either using domain knowledge to manually pick several observed covariates as factor proxies, or purely conducting multivariate analysis such as principal component analysis. However, the former approach may suffer from the bias while the latter can not incorporate additional information. We propose to bridge these two approaches while allowing the number of factor proxies to diverge, and hence make the latent factor model estimation robust, flexible, and statistically more accurate. As a bonus, the number of factors is also allowed to grow. At the heart of our method is a penalized reduced rank regression to combine information. To further deal with heavy-tailed data, a computationally attractive penalized robust reduced rank regression method is proposed. We establish faster rates of convergence compared with the benchmark. Extensive simulations and real examples are used to illustrate the advantages.
translated by 谷歌翻译
Deep convolutional neural networks have proven their effectiveness, and have been acknowledged as the most dominant method for image classification. However, a severe drawback of deep convolutional neural networks is poor explainability. Unfortunately, in many real-world applications, users need to understand the rationale behind the predictions of deep convolutional neural networks when determining whether they should trust the predictions or not. To resolve this issue, a novel genetic algorithm-based method is proposed for the first time to automatically evolve local explanations that can assist users to assess the rationality of the predictions. Furthermore, the proposed method is model-agnostic, i.e., it can be utilised to explain any deep convolutional neural network models. In the experiments, ResNet is used as an example model to be explained, and the ImageNet dataset is selected as the benchmark dataset. DenseNet and MobileNet are further explained to demonstrate the model-agnostic characteristic of the proposed method. The evolved local explanations on four images, randomly selected from ImageNet, are presented, which show that the evolved local explanations are straightforward to be recognised by humans. Moreover, the evolved explanations can explain the predictions of deep convolutional neural networks on all four images very well by successfully capturing meaningful interpretable features of the sample images. Further analysis based on the 30 runs of the experiments exhibits that the evolved local explanations can also improve the probabilities/confidences of the deep convolutional neural network models in making the predictions. The proposed method can obtain local explanations within one minute, which is more than ten times faster than LIME (the state-of-the-art method).
translated by 谷歌翻译
Generalized Category Discovery (GCD) aims to recognize both known and novel categories from a set of unlabeled data, based on another dataset labeled with only known categories. Without considering differences between known and novel categories, current methods learn about them in a coupled manner, which can hurt model's generalization and discriminative ability. Furthermore, the coupled training approach prevents these models transferring category-specific knowledge explicitly from labeled data to unlabeled data, which can lose high-level semantic information and impair model performance. To mitigate above limitations, we present a novel model called Decoupled Prototypical Network (DPN). By formulating a bipartite matching problem for category prototypes, DPN can not only decouple known and novel categories to achieve different training targets effectively, but also align known categories in labeled and unlabeled data to transfer category-specific knowledge explicitly and capture high-level semantics. Furthermore, DPN can learn more discriminative features for both known and novel categories through our proposed Semantic-aware Prototypical Learning (SPL). Besides capturing meaningful semantic information, SPL can also alleviate the noise of hard pseudo labels through semantic-weighted soft assignment. Extensive experiments show that DPN outperforms state-of-the-art models by a large margin on all evaluation metrics across multiple benchmark datasets. Code and data are available at https://github.com/Lackel/DPN.
translated by 谷歌翻译
强化学习(RL)是一种基于代理的方法,可以教机器人在物理世界中导航。已知收集RL的数据是一项费力的任务,现实世界实验可能会冒险。模拟器以更快,更具成本效益的方式促进培训数据的收集。但是,RL经常需要大量的仿真步骤才能使代理在简单任务上变得熟练。这是基于RL的视觉四面导航字段中普遍的问题,其中状态尺寸通常非常大,动态模型很复杂。此外,渲染图像和获得代理的物理特性在计算上可能很昂贵。为了解决这个问题,我们提出了一个基于Airsim的模拟框架,该框架提供了有效的并行训练。在此框架的基础上,APE-X经过修改,以结合空调环境的分散培训,以利用众多网络计算机。通过实验,我们能够使用上述框架将训练时间从3.9小时减少到11分钟,总共有74个代理和两台网络计算机。可以在https://sites.google.com/view/prl4airsim/home上找到有关我们项目Prl4airsim的更多详细信息和有关我们项目的视频。
translated by 谷歌翻译
我们提出了一个端到端的讲座视频生成系统,该系统可以直接从注释的幻灯片,讲师的参考语音和讲师的参考肖像视频中生成现实和完整的讲座视频。我们的系统主要由语音合成模块组成,具有很少的扬声器适应器和基于对抗性学习的说话头生成模块。它不仅能够减少讲师的工作量,还可以改变语言和口音,这可以帮助学生更轻松地跟随讲座,并能够更广泛地传播讲座内容。我们的实验结果表明,所提出的模型在真实性,自然性和准确性方面优于其他当前方法。这是一个视频演示,展示了我们的系统的工作原理以及评估和比较的结果:https://youtu.be/cy6tyki0cog。
translated by 谷歌翻译
在线学习平台中越来越多的学习痕迹有望对学习者知识评估(LKA)的独特见解,这是一种基本的个性化训练技术,可在这些平台中启用各种进一步的自适应辅导服务。对学习者知识的精确评估需要细粒度的Q-Matrix,该Q-Matrix通常由专家设计,以将项目映射到域中的技能。由于主观趋势,某些错误的错误可能会降低LKA的性能。已经做出了一些努力来完善小规模的Q-matrix,但是,很难扩展可扩展性并将这些方法应用于大规模的在线学习环境中,并具有许多项目和庞大的技能。此外,现有的LKA模型采用了灵活的深度学习模型,可以在这项任务上表现出色,但是LKA的适当性仍然受到模型在相当稀疏的项目技能图和学习者的锻炼数据上的表示能力的挑战。为了克服这些问题,在本文中,我们建议在线环境中针对学习者知识评估(PQRLKA)的先决条件驱动的Q-Matrix改进框架。我们从学习者的响应数据中推断出先决条件,并使用它来完善专家定义的Q-Matrix,从而使其可解释性和可扩展性应用于大规模的在线学习环境。根据精致的Q-Matrix,我们提出了一种Metapath2VEC增强的卷积表示方法,以获取具有丰富信息的项目的全面表示,并将其提供给PQRLKA模型,以最终评估学习者的知识。在三个现实世界数据集上进行的实验证明了我们模型推断Q-Matrix改进的先决条件的能力,以及其对LKA任务的优势。
translated by 谷歌翻译